Goto

Collaborating Authors

 taxonomic classification



BIOSCAN-5M: A Multimodal Dataset for Insect Biodiversity

Neural Information Processing Systems

Biodiversity plays a multifaceted role in sustaining ecosystems and supporting human well-being. Primarily, it serves as a cornerstone for ecosystem stability and resilience, providing a natural defence against disturbances such as climate change and invasive species (Cardinale et al., 2012).


Taxonomic Reasoning for Rare Arthropods: Combining Dense Image Captioning and RAG for Interpretable Classification

arXiv.org Artificial Intelligence

In the context of pressing climate change challenges and the significant biodiversity loss among arthropods, automated taxonomic classification from organismal images is a subject of intense research. However, traditional AI pipelines based on deep neural visual architectures such as CNNs or ViTs face limitations such as degraded performance on the long-tail of classes and the inability to reason about their predictions. We integrate image captioning and retrieval-augmented generation (RAG) with large language models (LLMs) to enhance biodiversity monitoring, showing particular promise for characterizing rare and unknown arthropod species. While a naive Vision-Language Model (VLM) excels in classifying images of common species, the RAG model enables classification of rarer taxa by matching explicit textual descriptions of taxonomic features to contextual biodiversity text data from external sources. The RAG model shows promise in reducing overconfidence and enhancing accuracy relative to naive LLMs, suggesting its viability in capturing the nuances of taxonomic hierarchy, particularly at the challenging family and genus levels. Our findings highlight the potential for modern vision-language AI pipelines to support biodiversity conservation initiatives, emphasizing the role of comprehensive data curation and collaboration with citizen science platforms to improve species identification, unknown species characterization and ultimately inform conservation strategies.


BIOSCAN-5M: A Multimodal Dataset for Insect Biodiversity

arXiv.org Artificial Intelligence

As part of an ongoing worldwide effort to comprehend and monitor insect biodiversity, this paper presents the BIOSCAN-5M Insect dataset to the machine learning community and establish several benchmark tasks. BIOSCAN-5M is a comprehensive dataset containing multi-modal information for over 5 million insect specimens, and it significantly expands existing image-based biological datasets by including taxonomic labels, raw nucleotide barcode sequences, assigned barcode index numbers, and geographical information. We propose three benchmark experiments to demonstrate the impact of the multi-modal data types on the classification and clustering accuracy. First, we pretrain a masked language model on the DNA barcode sequences of the BIOSCAN-5M dataset, and demonstrate the impact of using this large reference library on species- and genus-level classification performance. Second, we propose a zero-shot transfer learning task applied to images and DNA barcodes to cluster feature embeddings obtained from self-supervised learning, to investigate whether meaningful clusters can be derived from these representation embeddings. Third, we benchmark multi-modality by performing contrastive learning on DNA barcodes, image data, and taxonomic information. This yields a general shared embedding space enabling taxonomic classification using multiple types of information and modalities. The code repository of the BIOSCAN-5M Insect dataset is available at https://github.com/zahrag/BIOSCAN-5M.


Deep Visual-Genetic Biometrics for Taxonomic Classification of Rare Species

arXiv.org Artificial Intelligence

Visual as well as genetic biometrics are routinely employed to identify species and individuals in biological applications. However, no attempts have been made in this domain to computationally enhance visual classification of rare classes with little image data via genetics. In this paper, we thus propose aligned visual-genetic inference spaces with the aim to implicitly encode cross-domain associations for improved performance. We demonstrate for the first time that such alignment can be achieved via deep embedding models and that the approach is directly applicable to boosting long-tailed recognition (LTR) particularly for rare species. We experimentally demonstrate the efficacy of the concept via application to microscopic imagery of 30k+ planktic foraminifer shells across 32 species when used together with independent genetic data samples. Most importantly for practitioners, we show that visual-genetic alignment can significantly benefit visual-only recognition of the rarest species. Technically, we pre-train a visual ResNet50 deep learning model using triplet loss formulations to create an initial embedding space. We re-structure this space based on genetic anchors embedded via a Sequence Graph Transform (SGT) and linked to visual data by cross-domain cosine alignment. We show that an LTR approach improves the state-of-the-art across all benchmarks and that adding our visual-genetic alignment improves per-class and particularly rare tail class benchmarks significantly further. We conclude that visual-genetic alignment can be a highly effective tool for complementing visual biological data containing rare classes. The concept proposed may serve as an important future tool for integrating genetics and imageomics towards a more complete scientific representation of taxonomic spaces and life itself. Code, weights, and data splits are published for full reproducibility.


To Classify is to Interpret: Building Taxonomies from Heterogeneous Data through Human-AI Collaboration

arXiv.org Artificial Intelligence

Taxonomies serve this purpose as structured classification schemes that adhere to domain-specific standards. The importance of organizing, segmenting, and classifying data is especially obvious in light of the ever growing amount of information that is being created, aggregated, and made available through specialized data repositories or on the Internet. In light of the amount and heterogeneity of the available data, classification can hardly be addressed by means of manual-cognitive processing alone. Systems that integrate machine learning (ML) are able to process large amounts of data and, thus, can help with the task of classification and organization. However, delegating this task to ML-based systems in their entirety would mean that we sideline human interpretation and rely on the output of black-boxed systems that reproduce language ideologies and representational harms (see, e.g., [5]). As an attempt to highlight the interpretative character of classification and taxonomy building, we propose to design ML-based systems that enable human-AI collaboration. Such systems are designed with the goal to effectively combine human competencies and computational capabilities (see, e.g.,[27, 29]). Our approach enables domain experts to iteratively interact with the suggestions of the system while retaining interpretative authority. We report on the concept and implementation of this approach that we realized for two real-world use cases.


Zero-phase angle asteroid taxonomy classification using unsupervised machine learning algorithms

arXiv.org Artificial Intelligence

We are in an era of large catalogs and, thus, statistical analysis tools for large data sets, such as machine learning, play a fundamental role. One example of such a survey is the Sloan Moving Object Catalog (MOC), which lists the astrometric and photometric information of all moving objects captured by the Sloan field of view. One great advantage of this telescope is represented by its set of five filters, allowing for taxonomic analysis of asteroids by studying their colors. However, until now, the color variation produced by the change of phase angle of the object has not been taken into account. In this paper, we address this issue by using absolute magnitudes for classification. We aim to produce a new taxonomic classification of asteroids based on their magnitudes that is unaffected by variations caused by the change in phase angle. We selected 9481 asteroids with absolute magnitudes of Hg, Hi and Hz, computed from the Sloan Moving Objects Catalog using the HG12 system. We calculated the absolute colors with them. To perform the taxonomic classification, we applied a unsupervised machine learning algorithm known as fuzzy C-means. This is a useful soft clustering tool for working with {data sets where the different groups are not completely separated and there are regions of overlap between them. We have chosen to work with the four main taxonomic complexes, C, S, X, and V, as they comprise most of the known spectral characteristics. We classified a total of 6329 asteroids with more than 60% probability of belonging to the assigned taxonomic class, with 162 of these objects having been characterized by an ambiguous classification in the past. By analyzing the sample obtained in the plane Semimajor axis versus inclination, we identified 15 new V-type asteroid candidates outside the Vesta family region.


Two theses of knowledge representation: Language restrictions, taxonomic classification, and the utility of representation services

Classics

Levesque and Brachman argue that in order to provide timely and correct responses in the most critical applications, general-purpose knowledge representation systems should restrict their languages by omitting constructs which require nonpolynomial worst-case response times for sound and complete classification. They also separate terminological and assertional knowledge, and restrict classification to purely terminological information. We demonstrate that restricting the terminological language and classifier in these ways limits these “general-purpose” facilities so severely that they are no longer generally applicable. We argue that logical soundness, completeness, and worst-case complexity are inadequate measures for evaluating the utility of representation services, and that this evaluation should employ the broader notions of utility and rationality found in decision theory. We suggest that general-purpose representation services should provide fully expressive languages, classification over relevant contingent information, “approximate” forms of classification involving defaults, and rational management of inference tools.